1 JP920030092US1 

METHODS. APPARATUS AND COMPUTER PROGRAMS FOR MANAGING 
PERFORMANCE AND RESOURCE UTILIZATION WITHIN 

CLUSTER-BASED SYSTEMS 

FIELD OF INVENTION 

■i 

The present invention relates to methods, apparatus and computer programs for managing 
performance or resource utilization, or both performance and resource utilisation, for data 
processing systems such as cluster-based systems. 

BACKGROUND 

Cluster-based architectures are a useful platform for hosting many Internet applications 
such as Web servers or Web services. A cluster-based system includes a front-end 
(gateway) node connected by a local area network to a set of back-end nodes. The front- 
end node receives the requests and forwards it to a back-end node where the actual 
processing of the request takes place. There are many advantages to a cluster based 
system, including incremental scalability, increased availability and performance, cost 
control and maintenance. Hence, clusters are used for scalable Web servers, as described 
by M.Aron, D.Sanders, P.Druschel, and W.Zwaenepoel, in "Scalable Content-Aware 
Request Distribution in Cluster-based Network Servers", Proceedings of 2000 USENIX 
Annual Technical Conference, June 2000, and described by M.Aron, P.Druschel and 
W.Zwaenepoel in "Efficient Support for P-HTTP in Cluster-based Web Servers", 
Proceedings of 1999 USENIX Annual Technical Conference, June 1999. 

r 

A specific technique for content-based request distribution within cluster-based network 
servers is described by V.S.Pai, M.Aron, G.Banga, M.Svendsen, P.Druschel, 
W.Zwaenepoel, and E.Nahum in "Locality-Aware Request Distribution in Cluster-based 
Network Servers", Proceedings of 8th ACM Conference on Architectural Support for 
Programming Languages and Operatiing Systems, October 1998. Locality-aware request 
distribution (LARD) involves dividing data into partitions on the back-end servers and 
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using a front-end to distribute incoming requests in a manner that takes account of where 
the data is stored. 

A cluster-based architecture is also suitable for the provision of Web services, as 
described by E.Casalicchio and M.Colajanni in "A Client-Aware Dispatching Algorithm 
for Web Clusters Providing Multiple Services", Proceedings of the 10th International 
World Wide Web Conference, May 2001. The advantages of clusters for Internet service 
provision are described by A.Fox, S.Gribble, Y.Chawathe, E.Brewer and P.Gauthier in 
"Cluster-Based Scalable Network Services", Proceedings of the Sixteenth ACM 
Symposium on Operating Systems Principles, October 1997. 

M.Welsh, D. Culler and E.Brewer, in "SEDA: An Architecture for Well-Conditioned, 
Scalable Internet Services", Proceedings of 18th Symposium on Operating Systems 
Principles (SOSP'01), October 2001, describe a framework for highly concurrent server 
applications which uses threading and aspects of event-based programming models to 
provide automatic tuning in response to large variations in load. The framework is 
referred to as the staged event-driven architecture (SEDA). Stages are separated by event 
queues and include controllers which dynamically adjust resource allocation and 
scheduling in response to changes in load. The size of each stage's thread pool is adjusted 
based on the monitored length of an event queue. 

Various mechanisms have been tried to improve the performance of cluster based 
systems - including caching, load balancing and client hand-off (e.g. TCP hand-off). The 
front-end may distribute the requests such that the load among the back-end nodes is 
balanced and the load may be distributed based on the client, request content, current 
resource usage or scheduling algorithms. Load distribution based on clusters is known to 
improve the scalability, availability and fault tolerance of Internet services. Various hand- 
off mechanisms have been reported in literature for request forwarding. However these 
typically require changes in the operating system and are not portable. 

In "On the Use of Virtual Channels in Networks of Workstations with Irregular 
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Topology", IEEE Transactions on Parallel and Distributed Systems, Vol. 1 1, No. 8, 
August 2000, pages 813-828, F. Silla and J. Duato describe a flow control protocol for 
supporting a number of virtual channels in a network of workstations which implements 
adaptive routing. The flow control protocol uses channel pipelining and seeks to 
minimize control traffic. 

Published US Patent Application Nos. 2002/0055980, 2002/0055982 and 2002/0055983 
(Goddard) describe a server computer, such as a cluster-based Web server, having a 
plurality of persistent connections to a dispatcher. The dispatcher monitors the 
performance of the back-end server and, in order to improve back-end server 
performance, controls either the number of concurrently processed data requests or the 
number of concurrently supported connections. 

Although a lot of work has been done to improve the performance of cluster-based 
systems, there is scope for further improvement. For example, the existing cluster-based 
systems do not make effective use of the network bandwidth between the front-end and 
back-end nodes. Existing cluster-based systems do not fully exploit the benefits of 
multiple connections. The existing systems are generally configured statically and do not 
adapt to the changing workload on the system. Further, the existing systems are generally 
based on direct network subsystems (e.g. TCP/IP) and do not exploit the benefits of the 
mediated network subsystems (e.g. Java Messaging Service). 

SUMMARY 

A first aspect of the present invention provides a method for managing connections 
between data processing units of a data processing system. Concurrency benefits are 
provided by establishing multiple persistent connections between first and second data 
processing units of the system. The optimal number of connections between the data 
processing units depends oh the load on the system (such as the number of concurrent 
client requests) as well as the type of request (data-intensive / CPU-intensive) sent 
between the data processing units. The method includes the steps of monitoring 
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communication delays for requests transferred from a first data processing unit to a 
second data processing unit of the system and, in response to the monitored 
communication delays indicating a predefined performance condition, modifying the 
number of persistent connections between the first and second data processing units. 

5 

A 'connection' in the context of the present application comprises the physical set-up of a 
communication channel between the connection end-points. Establishing a 'connection' 
typically includes exchanging and storing an identification of the addresses of the 
connection endpoints and the communication port numbers to be used, and reserving 
10 resources for use in communications via the connection - such as system memory and 
buffer storage areas. Subject to the available communication bandwidth, there may be a 
large number of connections defined for use over a single physical link between two 
computer systems. A 'persistent connection' is a connection which persists across 
multiple requests. 

15 - 

A first embodiment of the invention provides a method for managing persistent 
connections between data processing units of a computer system, wherein a first data 
processing unit is connected to a second data processing unit to send requests to the 
second data processing unit for processing, the method comprising the steps of: 

20 monitoring a communication delay period for requests transferred from the first 

data processing unit to the second data processing unit; 

comparing the monitored delay period with a threshold delay period to determine 
whether the monitored delay period indicates a predefined performance condition; and, 
in response to determining that the monitored delay period indicates the 

25 predefined performance condition, adjusting the number of connections between the first 
and second data processing units. 

The method can be applied within cluster-based data processing systems for managing 

* 

i 

the number of persistent connections between a front-end 'dispatcher' or 'gateway' node 
30 and each of a cluster of back-end processing nodes of the system. 
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The communication delays can be calculated so as to exclude processing times at the 
second data processing unit, as follows. The communication delays monitored according 
to one embodiment of the invention are calculated as a difference between a first 
timestamp generated when sending a request from the first data processing unit and a 
second timestamp generated when a response from the second data processing unit is 
received at the first data processing unit, minus the time actually processing the request 
(which is measured at the second data processing unit). This calculated time period, 

corresponding to the difference between timestamps minus processing time, is termed the 

« 

queueing delay. The affect on the queuing delay of the size and other characteristics of an 
individual request is not as great as the effect on total response times. 

* 

In a particular embodiment, the method is responsive to the monitored delay period 
exceeding a first threshold delay period to initiate establishment of at least one additional 
connection between the first and second data processing units (subject to the number of 
connections not exceeding a maximum, above which performance degrades for some 
load levels). The first threshold delay period is preferably determined as a value 
representing a minimum delay period for which the addition of one or more connections 
can reduce communication delays by an amount justifying the addition. 

In one embodiment, a second threshold delay period is also defined and, in response to 
determining that the monitored delay period is less than the second threshold delay 
period, at least one connection between the first and second data processing units is 
deleted (subject to retaining at least one connection). The second threshold delay period 
is preferably identified as a delay period below which one or more connections can be 
deleted without increasing the delay period by an unacceptable amount. 

A second embodiment of the invention provides a data processing system comprising: 

a first data processing unit for receiving requests from a client requestor and 
passing received requests to a second data processing unit, and for receiving responses 
from the second data processing unit and forwarding received responses to the client 
requestor; 
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a second data processing unit for processing requests received from the first data 
processing unit to generate responses, and for forwarding the responses to the first data 
processing unit; and 

a connection manager for managing the number of connections between the first 
5 and second data processing units, the connection manager being responsive to monitored 
communication delays between the first and second data processing units indicating a 
predefined performance condition to modify the number of persistent connections 
between the first and second data processing units. 

10 A third embodiment of the invention provides a data processing system comprising: 

a network subsystem; 

a gateway node for receiving requests from a client requestor and passing 
received requests to the network subsystem for delivery to one of a set of back-end 
processing nodes, and for receiving responses from the back-end processing nodes via the 
1 5 network subsystem and forwarding received responses to the client requestor; and 

a set of back-end processing nodes for processing requests received from the 
gateway node via the network subsystem to generate responses, and for forwarding the 
responses to the gateway node via the network subsystem; 

a connection manager for managing the number of connections between the 
20 gateway node and each of the back-end processing nodes, the connection manager being 
responsive to monitored communication delays between the gateway node and the back- 
end processing nodes indicating a predefined performance condition to modify the 
number of persistent connections between the gateway node and at least a first one of the 
back-end processing nodes. 

25 

In one embodiment, communication delays are monitored and averaged for the set of 
back-end nodes and an equal number of connections is provided between each back-end 
node and the gateway node. In a preferred embodiment, the connection manager is 
responsive to the monitored communication delays exceeding a first delay threshold to 
30 increase the number of persistent connections between the gateway node and each back- 
end processing node, and is responsive to the monitored communication delays being less 
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than a second delay threshold to decrease the number of persistent connections between 
the gateway node and each back-end processing node. 

A further embodiment provides a method for managing persistent connections between a 
gateway node and each of a set of back-end processing nodes arranged in a cluster within 
a data processing system, the method comprising the steps of: 

monitoring a communication delay period for requests transferred from the 
gateway node to the back-end processing nodes; 

comparing the monitored communication delay period with a threshold 
communication delay period to determine whether the monitored communication delay 
period indicates a predefined performance condition; and, 

in response to determining that the monitored communication delay period 
indicates a predefined performance condition, adjusting the number of connections 
between the gateway node and at least one of the set of back-end nodes. 

A further embodiment provides a computer program product, comprising program code 
recorded on a recording medium for controlling operations on a data processing system 
on which the program code executes, the program code comprising a connection manager 
for managing the number of connections between a first data processing unit and a 
second data processing unit of the system by: 

monitoring a communication delay period for requests transferred from the first 
data processing unit to the second data processing unit, 

comparing the monitored communication delay period with a threshold 
communication delay period to determine whether the monitored communication delay 
period indicates a predefined performance condition; and, 

in response to determining that the monitored delay period indicates a predefined 
performance condition, adjusting the number of connections between the first and second 
data processing units. 

Methods, systems and computer programs according to embodiments of the invention can 
be used to modify the number of connections between system nodes in response to 
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varying load on a system, for systems which use either a mediated or a direct connection 
network subsystem. Also, methods and systems according to embodiments of the 
invention can work in conjunction with existing solutions - for example in conjunction 
with load balancing or in conjunction with adaptive containers - to further improve the 
5 performance of the system. 

* 

BRIEF DESCRIPTION OF DRAWINGS 

j 

Embodiments of the invention are described below in detail, by way of example, with 
10 reference to the accompanying drawings in which: 

Figure 1 is a schematic representation of a cluster-based system architecture; 

> 

Figure 2A is a graphical representation of the variation in throughput with the number of 
1 5 connections, showing the effect of concurrency on performance; 

Figure 2B is a graphical representation of the variation in queuing delay with the number 
of connections, showing the effect of concurrency on performance; 

* 

20 Figure 3 shows an upper bound on the number of connections, beyond which throughput 
is not improved by increasing the number of connections; 

Figure 4 represents a methodology for computing the queuing delay, according to an 
embodiment of the invention; 

25 

Figure 5 shows a system architecture according to an embodiment of the invention; and 

Figure 6 shows the steps of a method for adding or deleting connections, according to an 
embodiment of the invention. 

30 
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DETAILED DESCRIPTION OF EMBODIMENTS 

Some portions of the following description are explicitly or implicitly presented in terms 
of algorithms and symbolic representations of operations on data within a computer 
5 memory. These algorithmic descriptions and representations are the means used by those 
skilled in the data processing arts to most effectively convey the substance of their work 
to others skilled in the art. An algorithm is conceived to be a self-consistent sequence of 
steps leading to a desired result. The steps are those requiring physical manipulations of 
physical quantities. Usually these quantities take the form of electrical or magnetic 
10 signals capable of being stored, transferred, combined, compared, and otherwise 
manipulated. It is often convenient to refer to these signals as bits, values, elements, 
symbols, characters, numbers, or the like. 

However, the above and similar terms are to be associated with the appropriate physical 
15 quantities and are merely convenient labels applied to these quantities. Unless 
specifically stated otherwise, it will be appreciated that throughout the present 
specification discussions utilising terms such as "computing", "calculating", 
"determining", "comparing", "generating", "selecting", "outputting", or the like, refer to 
the action and processes of a computer system, or similar electronic device, that 
20 manipulates and transforms data represented as physical (electronic) quantities within the 

n - 

registers and memories of the computer system into other data similarly represented as 
physical quantities within the computer system memories or registers or other such 
information storage, transmission or display devices. 

25 The present specification also discloses apparatus for performing the operations of the 
methods. Such apparatus may be specially constructed for the required purposes, or may 
comprise a general purpose computer or other device selectively activated or 
reconfigured by a computer program stored in the computer. The algorithms presented 
herein are not inherently related to any particular computer or other apparatus. Various 

30 general purpose machines may be used with programs in accordance with the teachings 
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herein. Alternatively, the construction of more specialised ^apparatus to perform the 
required method steps may be appropriate. 

In addition, the present specification also discloses a computer readable medium 
5 comprising a computer program for performing the operations of the methods. The 
computer readable medium is taken herein to include any transmission medium for 
communicating the computer program between a source arid a destination. The 

=6- 

transmission medium may include storage devices such as magnetic or optical disks, 
memory chips, or other storage devices suitable for interfacing with a general purpose 

10 computer. The transmission medium may also include a hard-wired medium such as 
exemplified in the Internet system, or wireless medium such as exemplified in the GSM 
mobile telephone system. Any computer program described herein is not intended to be 
limited to any particular programming language and implementation thereof. It will be 
appreciated that a variety of programming languages and coding thereof may be used to 

15 implement the teachings of the disclosure contained herein. 

Where reference is made in any one or more of the accompanying drawings to steps or 

features which have the same reference numerals, those steps or features have the same 

* 

function(s) or operation(s) for the purposes of this description, unless the contrary 
20 intention appears. 

Cluster Overview 

Figure 1 shows a typical cluster-based system architecture in which the present invention 
may be implemented. A front-end node 10 acts as a gateway to the data processing nodes 
. 25 of the system, and is connected to the back-end nodes 20 by a network subsystem 30. The 
front-end node receives requests from clients 50 and is responsible for distributing the 
incoming requests to the back-end in a manner that is transparent to clients. The back-end 
nodes 20 carry out the main processing of the requests, accessing databases 40 as 
required, and return a response to clients 50 either directly or via the front-end node 10. 

30 

The present invention is not limited to specific data processing hardware, and the various 
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architectural components shown in Figure 1 may be implemented by any suitable data 
processing apparatus - including separate computing systems or separate data processing 
units integrated within a parallel processing system (for example, back-end nodes 20 may 
be separate processing units of a single system). The invention is, however, applicable to 
5 clustered Web servers in which Web server software is running on each of a plurality of 
back-end server computers. In that case, each of the back-end computers 20 cooperates 
with a gateway server 10. * 

The front and back end nodes are connected to each other by a network subsystem 30 
1 0 which can be either a direct connection network subsystem or mediated connection 

network subsystem. In the direct connection subsystem (such as in a TCP/IP 

implementation), a persistent connection is open between the front and back end nodes; 

whereas in a mediated connection subsystem, the connection is via a broker sitting 

between the front and back end nodes. A persistent connection is one in which the 
15 connection persists across multiple requests, such that it is not necessary to create a new 

connection for each request. The front end and back end nodes each act as both a 

producer and consumer of messages. 

A number of generic functions such as authentication, access control, load balancing, 
20 request distribution and resource management are handled by the front-end node. The 

front-end node runs a multi-threaded server providing these functions (i.e. a server which 
can handle concurrent requests), and leaves the back-end node to focus on the core 
service functions. Typically, the front-end distributes the requests such that the load 
among the back-end nodes is balanced. 

25 

With content-based request distribution, the front-end additionally-takes into account the 
content or type of service requested when deciding to which back-end node a client 
request should be assigned. Each back-end node is a multi-threaded server (Container) 
and provides the core service functionality. The generic term "Container" is used herein 
30 to denote the back-end node as it contains the Web service and is a generic server that 
may be listening on HTTP, TCP/IP or JMS connections. 
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System Architecture 

An example system architecture is described below. A gateway is connected to a backend 
cluster to enable communications between them using push/pull methodology, and using 
a simple round robin algorithm for load balancing across the cluster of backend nodes. 

The system includes the following components: 

o A Broker that facilitates "push/pull" communication. 

o A Gateway that accepts requests from clients over HTTP. The gateway: 

o pushes the incoming requests to the broker by publishing the request on a 

specific Request Topic; 
o pulls the response from the broker by subscribing to a corresponding 

Response Topic on the broker; and 
o returns the response to the client over the open HTTP connection, 
o A cluster of back-end nodes. Each back-end node is a server that: 
o Deploys/Contains the Web Service; 

o Pulls requests from the broker by subscribing to the Request Topic and 

invokes the relevant web service; and 
o Pushes the response from the web service to the broker by publishing the 

response on the Response Topic, 

In push/pull systems, once a publisher (or subscriber) establishes a topic for publishing 
(or subscribing), the publisher (or subscriber) keeps a persistent connection with the 
broker and pushes (pulls) the data over the persistent connection. 

Let there be N back-end containers Ci, C2, ... , Cn. Initially, there is exactly one Request 
Topic "ReqTopicCj^and one Response' Topic "ResTopicCi lu per back-end container, Ci. 
Thus, there is one persistent connection between the Gateway and each back-end 
container via the broker. As each request is received by the gateway from the clients, the 
gateway publishes the first request on the topic "ReqTopicCi u \ the second request on the 
topic "ReqTopicC2 K \ and so on till the Nth request is published on "ReqTopicCisj 1 ". The 
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* 

(N+l)th request will then be published again on "ReqTopicCi ! "and so on in Round 
Robin fashion. 

I * 

When the load increases, several requests may be pipelined on the same connection. That 
5 is, several requests are sent via the same connection without waiting for the gateway to 
complete its execution of a request before the next request is sent. Similar pipelining is 
used for responses. Under heavy loads, the connection pipeline may become 'full' - that 
is, the computer memory and buffers reserved for the connection become full and no 
further requests (or responses) can be added Until existing requests (or responses) have 

5. 

10 been processed. This leads to a delay in transferring the request (or response) onto the 

connection pipeline, and a consequent delay in transferring the request from the gateway 
to the back-end node and from the back-end to the gateway. This delay is termed the 

* 

queuing delay. 

1 5 To reduce the queuing delay, the number of persistent connections between the gateway 
and each back-end container may be increased. For example, if the number of persistent 

i 

connections is increased to 3, then the Request topics used will be, "ReqTopicCi 1 ", 

"ReqTopicCi 2 " and "ReqTopicCi 3 ' 6 for back-end container Ci, and similarly for the 

* * 

remaining containers. Corresponding Response topics are also generated. This is 
20 described in more detail below. The establishment of each new connection involves 

reserving system memory and buffer space and exchanging communication port numbers 
and network addresses for use in the transfer of data between the connection end points. 
Establishment of a new connection can also involve defining the communication protocol 
and quality of service parameters (in some systems) or registering with a broker (in 
25 mediated connection systems). Deleting a connection returns the reserved resources to the 
system for other uses. The allocation of requests to the multiple connections uses a round- 
robin approach - although the invention is not limited to any specific workload sharing or 
load balancing algorithm. 

30 Performance Characteristics 

The performance of the system is measured in terms of throughput - the total number of 
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requests served by the system in a given amount of time. The time over which throughput 
is measured is termed the cycle time. Throughput is influenced by the following three 
parameters: 

• Number of concurrent clients; 

• Length of the message (response from the container) exchanged between front-end 
and 

back-end nodes for each request; and 

• Computational activity (CPU load) required for each request at the back end node 
(container). 

The first of these parameters loads the entire system whereas the other two parameters 
load the back-end containers. It has been observed that an increase in message size (such 
as when handling increasingly data-intensive requests) or an increase in computational 
activity (such as when handling increasingly compute-intensive requests) essentially has 
the same effect - both types of loading of the container result in a decrease in throughput. 
As the load on the system is increased, the pipeline between the gateway and the back- 
end node becomes utilized to its full capacity. This results in an inability to increase 
throughput via the existing connection, with a consequent increase in the queuing delay. 

As described in detail below (in section 'Specific System Architecture and Algorithm'), it 
is possible to increase the number of connections within the constraints of the available 
bandwidth (which is substantial for a typical cluster-based system). This can reduce the 
queuing delay by increasing the number of persistent connections between the back-end 
and front-end nodes, thereby improving the overall system performance. Figure 2 shows 
the effect of the number of persistent connections on throughput for heavy request loads. 

» 

In particular, Figure 2A shows the effect of concurrency on performance in terms of the 
variation in throughput according to changes in the number of connections. Each curve 
represents a different number of persistent connections as follows: Curve A represents 1 
connection; B represents 5 connections; C represents 15 connections; D represents 20 
connections; E represents 30 connections; F represents 60 connections; and G represents 
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90 connections. 

Figure 2B shows the effect of concurrency on performance, in terms of the variation in 
queuing delay according to the number of connections. In Figure 2B, the labelled curves 
each represent a different number of persistent connections as follows: A=l, B=5, C=15, 
D=20, E=30, F=60, and G=90. 

■ 

As shown in Figure 2A, the throughput of the system for a given number of concurrent 
clients increases with increases in the number of persistent connections, up to a point 
beyond which it starts to degrade (as exemplified by curve G). Similar behavior is 
reflected in Figure 2B - the queuing delay reduces as a result of increasing the number of 
connections - up to a point beyond which it again increases (as exemplified by curve G). 
Thus, there is an upper bound on the number of connections beyond which increasing the 
number of connections brings no further performance benefits and can be detrimental. 
This deterioration of performance and increase of queuing delay if the number of 
connections is allowed to exceed a maximum number is partly due to contention for 
resources such as available buffer storage/system memory, and partly because of 
processing overheads such as synchronization. This behavior is shown more clearly in 

Figure 3. 

■ 

Although Figures 2 and 3 show experimental results, their inclusion within this 
specification is by way of example only, to show qualitative variations and typical 
behaviours. In particular, numerical values represented by the curves in Figures 2 and 3 
are merely exemplary and Figures 2 and 3 are not representative of the same 

% 

experimental data (and so there are differences between the values represented 
graphically in the figures). 

In Figure 3, the labelled curves each represent a different number of concurrent clients, as 
follows: P=10; Q=50; R=100; S=200; and T=250. For a given number of concurrent 

* 

clients, the throughput initially increases as the number of connections increases. Beyond 
a given number of connections, the throughput rapidly decreases with increases in the 



16 JP920030092US1 

number of connections. For example, as shown in curve T in Figure 3 which represents 
250 concurrent clients, throughput rapidly increases with an increased number of 
connections from 1 to 20. Beyond 20 connections, the throughput rapidly falls. 

5 For light requests (which are neither CPU-intensive nor data-intensive), increasing the 
number of connections does not give substantial benefits and also the bound on the 
number of connections is smaller than the case for heavy requests (which are CPU- 
intensive, data-intensive or both). 



10 Monitoring System Performance 

The performance characteristics of the cluster based system indicate that queuing delay 
(or 'network delay') is a significant determinant of the overall performance of the system 
- irrespective of the load or the type of load. Concurrency benefits provided by multiple 
connections (within bounds) can be exploited to reduce the queuing delay and improve 

15 the overall performance of the system. In particular, an adaptive v system can vary the 
number of connections as the queuing delay changes to operate in the optimal zone. In 
the present embodiment, queuing delay is used as the monitored performance 
characteristic and the number of persistent connections is used as the control parameter to 
improve the overall performance of the system. The queuing delay (d) is computed at the 

20 gateway using the following methodology, as shown in Figure 4. 

Computing the queueing delay (d) 

Referring to Figures 1 and 4, the front-end (gateway) 10 receives a request from a client 
50 and timestamps the request (with timestamp 757) before forwarding the request to the 

25 back-end 20. The timestamp is stored at the gateway 10 as local data in the controlling 
thread. The back-end receives the request and processes it, and keeps track of the 
processing time. The back end 20 sends back data indicating the time taken to process the 
request along with the response. On receiving the response, the front-end again 
timestamps (TS2) the request and computes the queuing delay using the following 

30 equation: 
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d = (TS2 — TS1) — Processing time 

The gateway maintains two variables "TotalDelay" and "Count" which are initially set to 
zero. For each completed request, "TotalDelay" is incremented by the value of the delay 
5 and "Count" is incremented by 1. At the end of a cycle, the average queuing delay {cIa v) 
is calculated by dividing "TotalDelay" by "Count". 

To correlate the queuing delay with the number of connections, the system is calibrated 
(or "benchmarked"). This calibration involves computing a threshold value (/) for the 
10 average queuing delay beyond which it is possible to improve the performance by adding 
extra connections. There is also an upper bound (h) beyond which adding further 
connections actually degrades the performance, so the system is also calibrated to 
compute the upper bound. The methods for computing the threshold and upper bound for 
a number of connections are described below. 

15 

Computing the queuing delay threshold (t) 

Figure 2 shows the variation of throughput and queuing delay with client loads for 
various different numbers of connections. The system is calibrated by plotting throughput 
and queueing delay against each of the number of connections and the number of 

20 concurrent clients (such as in figures 2A, 2B and 3), for both data- intensive and CPU- 
intensive requests. For the throughput graph, the performance curves for 1 connection 
and for (1+a) connections are compared, and a minimum client load is identified at which 
there is an jc% difference in throughput, 'jc ' is a configuration parameter corresponding to 
the minimum percentage difference which justifies modification of the number of 

25 connections; x will typically vary from system to system. In the present embodiment, 

factors used in the selection of x include the stability of the system and the desirability of 
minimizing oscillations in the number of connections. If the value of x is too low, the 
state of the system is frequently disturbed (for each increment x% in the throughput). For 
smaller numbers of concurrent clients, modifying the number of connections will not 

30 provide significant benefits. 
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The performance curves shown in Figure 2 A are generated from a set of discrete points, 
(load, throughput) at various client loads (number of concurrent clients) for a respective 
fixed number of connections, 1, l+a, \+2a ....etc. Let us consider the set of discrete 
points for 1 and \+a connections. For each client load, the percentage increase in 
5 throughput from 1 connection to 1 +a connections is found. The increase in throughput is 
compared with the desired value jc, and the corresponding minimum client load at which 
we get this increase is identified. If there is no exact match, two points are identified such 
that the desired increase in throughput jc lies between the two points. A linear 
interpolation technique is used to identify the minimum client load at which anx% 
10 increase in throughput is achieved by changing from 1 connection to l+a connections. 
The selected load point is projected onto the queuing delay curve to find the value of the 
queuing delay for 1 connection (since the system is initially configured with a single 
connection to each back-end server). This sequence of steps can be implemented in 
computer program code, as will be recognized by persons skilled in the art. 

15 

The performance curve as shown in Figure 2B is a set of discrete points, (load, queuing 
delay) at various client loads for a fixed number of connections, 1, l+a, l+2a ....etc. For 
the selected load value, the queuing delay is found from Figure 2B. If the value for the 
selected load point is not directly available then two points are identified such that the 
20 selected load value lies in between the two points. A linear interpolation technique is used 
to find the queuing delay for the selected load. 

This procedure is followed, projecting the load point on the queuing delay curve to 
provide the threshold queuing delay for CPU intensive requests, tc, and the threshold 
25 queuing delay for data intensive requests, td. The minimum of these two values, /erf, is 
then computed: 

ted = minimum (tc, td) 

30 The approach of taking a minimum for different load conditions provides a computed . 
value, ted, which is the lowest projected threshold queuing delay for CPU-intensive and 
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. data-intensive request loads. 

From the queuing delay curves (as shown in Figure 2B), the minimum values of queuing 
delay for the CPU-intensive (dcm) and data- intensive (ddm) cases are also identified. The 
5 value of the queuing delay threshold, t, is then determined as the maximum of ted, dem, 
and ddm: 

t = maximum {ted, dem, ddm) 

10 This taking of a maximum is to safeguard against a potential problem which could arise if 
the lowest projected queueing delay threshold, ted, is lower than the minimum for data- 
intensive and CPU-intensive cases determined separately. If the threshold queuing delay 
was lower than the minimum for data intensive and CPU-intensive cases, although this is 
unlikely, the system may never achieve a steady state - the number of connections will be 

1 5 increased repeatedly in an attempt to bring the monitored value of the queuing delay 

down to an unachievable value, until the number of connections reaches an upper bound. 
Taking a maximum after determining the projected minimum, ted, avoids this potential 

* 

r 

. problem. 

k 

20 Modifications to the number of connections in accordance with this embodiment will 
tend towards a determined optimum number of connections but may not reach it; this is 
considered acceptable to ensure the modifications do not degrade performance. In 
practice, typical load conditions include a mix of both CPU-intensive and data-intensive * 
requests and the request type of each request is not known in advance of processing the 

25 request. 

- 

* 

Computing the upper bound on the number of connections (h) 

Figure 3 shows the variation of throughput with the number of connections for various 
client loads. The system is calibrated to get similar graphs for data-intensive requests and 
30 CPU-intensive requests. The point at which the throughput is at a maximum is identified, 
and projected to get the corresponding value for the number of connections. 
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The performance curve as shown in Figure 3 is a set of discrete points, (connections, 
throughput) at different connections for a fixed client load. For each client load, the point 
at which the throughput is a maximum is identified and the corresponding number of 
5 connections is noted. The minimum of this value is then found for all the curves. 

This procedure is followed to get the upper bound, he, for the GPU-intensive requests and 
the upper bound, hd, for the data-intensive requests. The upper bound on the number of 
connections is selected as the minimum of the two values. 

10 

h = minimum (hc.hd) 

Specific System Architecture and Algorithm 

The following description relates to an adaptive cluster-based system which monitors the 
15 network delay between the gateway and the container and adds/deletes connections based 
on this value. As explained above, the system is initially calibrated to determine threshold 
values. 

As shown in Figure 5, the runtime infrastructure includes a monitor 100 and a gateway 
20 component 1 10 at the front-end (gateway) system 10. A benchmark component 120 holds 
configuration and calibration information for the gateway 1 10. As described above with 
reference to Figure 4, the monitor 100 keeps track of the queuing delay, d, for each 
request by timestamping requests when sent and timestamping responses when received 
and subtracting the processing time measured by a monitor 130 in the back-end system 
25 20. This monitored processing time is returned to the front-end system with the response. 
The monitor 100 also averages the queueing delays over a period of time (the cycle time). 
The cycle time is a configuration parameter which can be varied. The delay is compared 
with the threshold value, i, to decide whether to add/delete a connection. If the average 
queuing delay, d A v, is within y% of the threshold (t - {yt/100} <d A y<t + {y//100}), the 
30 system is left undisturbed. If the average delay differs from the threshold delay, t, by 
more than y%, the number N representing a number of connections is changed. 
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The change, a, in the number of connections is in steps, for example adding or deleting 2 
connections at a time (a=2), with the value of a chosen to reduce the tendency for the 
number of connections to oscillate. The value of the number of connections, N, is 
5 checked to see if it is between 1 (the minimum number of connections) and the upper 
bound, h. If the value is within these bounds, a decision is made regarding- whether to 
add/delete one or more connections. After a decision to add or delete connections, the 
gateway and the container follow a protocol to add/delete connections as described 
below. The same process is then repeated at regular intervals (the cycle time). 

10 

The sequence of method steps for adapting the number of connections to the current 
system load is described below with reference to Figure 6. 

Protocol to add or delete a connection 

15 In a direct connection network subsystem, such as TCP/IP, although we keep multiple 

parallel connections open, the server address does not change for each connection. On the 
other hand, in mediated connection subsystem such as JMS, each subscriber requires a 
separate topic and so the front-end and back-end need to co-ordinate the topic names. To 
solve the naming problem, control is retained at the front-end 10 (see Figure 5). If the 

20 front-end 1 0 determines a need for an extra connection, the front-end sends a message to 
the back-end 20 with the desired topic. The back-end 20 starts a new subscriber on the 
given topic and sends an acknowledgement to the front-end 10. The front-end can then 
start sending data on the new connection. 

i 

25 As shown in Figure 6, the process of managing the number of persistent connections 
starts 200 with a single connection (N=l). When a cycle period expires, the average 
queueing delay, d A v, for the period is computed 210 by the monitor in the front-end 
system 10, using the queuing delays calculated and saved for each request during the 
cycle period. A determination is performed of whether the computed average queuing 

30 delay, cIav, is within a predefined percentage, y% 9 of the threshold queueing delay, t. If 
<Iav is within y% oft, the potential benefits of modifying the number of connections are 
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deemed not to justify performing the task. Reducing the number of connections could 
reduce the throughput somewhat and yet increasing the number of connections will not 
provide substantial benefits (according to the specified x% level). 

5 If the determination in step 220 is negative, such that the average queuing delay, d A y, 
differs from the threshold queuing delay, /, by more than y%, a determination is 
performed 230 of whether d AV is greater than t. If d AV is greater than t, the value of TV 
(representing the current number of connections) held in a register is increased 240 by an 
integer value, a, (by setting N=N+a) to obtain a new value representing a potential 

10 increased number of connections. Before this new value is used to increase the number of 
connections, a check is performed 250 of whether the new value of N is less than or equal 
to the upper bound, h, on the number of connections. If Nis less than or equal to A, the 
number of connections is increased 260 to the new value of Nby adding a connections. If 
, the new value of N is above the upper bound h (as determined in step 250), the value of TV 

1 5 is reset 270 to the upper bound h (setting N = h) and this new value of N is applied to 
increase 280 the total number of connections to the upper bound h. 

However, if the determination at step 230 determines that the average delay, d A v, is not 
greater than the threshold value, /, (i.e. is less than / by at least y% in view of step 220), 
20 the value of N is decremented 290 by an integer value, a. A check is performed 300 of 
whether the new value ofN is greater than or equal to 1 . 

# 

If the new value N is determined at step 300 to be greater than or equal to 1 , the actual 
number of connections is reduced 3 10 by the integer value a, by deleting a connections. 
25 The newly set value ofN is now consistent with the actual number of connections in the 
system. Alternatively, if the result of the determination at step 300 is that the new value 
of N is determined to be less than 1, the value of Nis reset again 320 (by setting N=l). 
This new value of N is now applied 330 to the actual number of connections, deleting all 
except one connection. 

30 

Similarly, when a connection needs to be closed, the front-end 10 marks that connection 
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as redundant so that no more requests are sent on that connection. Then the front end 10 
sends a message to the back-end 20 with the identification of the last message sent on that 
topic. When the back-end receives the indicated message, the back-end closes the 

w 

connection. When the front-end receives a response for that message, it also closes the 
5 connection. 

In the case of TCP/IP persistent connections, when a connection needs to be added, the 
gateway opens a new connection with the back-end node and starts sending requests on 
the connection. When a connection needs to be deleted, the gateway stops sending any 
10 new requests on the connection. When all the responses on that connection have arrived, 
the gateway closes the connection. At the back-end, the closed connection is detected and 
automatically closed at the back-end also. 

■ 

The above-described embodiment can work with any container. Thus an adaptive 
15 container - which adapts (modifies) its configuration parameters based on the load on the 
system to maintain the optimal performance - can be used in a system according to an 
embodiment of the invention. 

It will be recognized by persons skilled in the art that various modifications can be made 
20 to the example process shown in Figure 6 within the scope of the present invention. For 
example, in step 220, it is not essential that the determination of the extent to which the 
average queuing delay, dAv> differs from the threshold value, /, is a determination of a 
percentage difference ^ for example maximum and minimum values of / may be used. 
Furthermore, if a determination is made to continue the process for modifying the number 
25 of connections when the average queueing delay exceeds the threshold value / by a 

particular percentage value y% 9 a different percentage value z% may be appropriate when 
considering the extent to which the average queueing delay, cIav, is less than the threshold 
value, /. 

30 Steps 220 and 230 of Figure 6 could be represented with equal validity as a single 

determination step. Step 250 could precede step 240, with the performance of step 240 
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being conditional on the outcome of step 250, in which case step 270 is not required. 
Similarly, step 300 could precede step 290 with the performance of step 290 being 
conditional on the outcome of step 300, such that step 320 is not required. The number of 
connections added, a, at step 260 may differ from the number of connections deleted at 
5 step 310 (for example, if usage of the system is characterised by rapid workload increases 
followed by a less rapid tail-off, or vice versa). 

In the embodiment described above in detail, the queueing delay d was computed as a 
difference between a timestamp on the request and a timestamp on the response, minus 
10 the back-end processing time (d = (TS2 - TS1) - Processing time). In embodiments in 
which back-end servers send responses direct to requestor clients without sending them 
via a dispatcher at the front-end, a notification including the response timestamp TS2 can 
still be provided to the front-end node to enable calculation of a queueing delay. 

1 5 In an alternative embodiment, the monitored communication delay is the time between a 
request being sent from the first data processing unit and the start of processing at the 
second data processing unit. This requires clock synchronization between the first and 
second data processing units, but does not require measurement of the processing time or 
timestamping of responses (and therefore can be advantageous for systems in which the 

20 second data processing unit sends responses directly to requestor clients without going 
via the first data processing unit). 

A further alternative embodiment monitors communication delays and modifies the 
number of connections separately for each of a plurality of back-end nodes within a 

25 cluster-based system. A separate monitor and connection manager is provided at the 
gateway for each of the back-end nodes. Despite the additional complexity of such a 
solution compared with solutions which modify the number of connections consistently 
for each back-end node within the cluster, such a solution can be advantageous in a 
cluster which has different types of connections between the gateway and different back 

30 end nodes. 
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Embodiments of the invention described above can be implemented in cluster-based Web 
servers, Web application servers and in Web-hosting service implementations. A system 
which is capable of adapting the number of persistent connections according to an 
embodiment of the invention can maintain performance when experiencing workloads 
5 which would cause degraded performance in many conventional systems. 



